Unsupervised learning of word separators with MDL
نویسندگان
چکیده
This paper describes a novel algorithm for the unsupervised learning of word separators in raw text. The algorithm requires no language-specific knowledge regarding the text being processed. It relies solely on distributional properties of the text and uses the minimum description length (MDL) principle in order to partition characters into two subsets that correspond well with the traditional notion of letters and separators. The distinction between these types of characters emerges as an optimal solution to the problem of simultaneously compressing two elements: the lexicon that is obtained by tokenizing the text using the hypothesized separators, and the representation of the text under this lexicon. The performance of the proposed algorithm is evaluated on the basis of electronic text in English, French and German.
منابع مشابه
Unsupervised Word Induction Using Mdl Criterion
Unsupervised learning of units (phonemes, words, phrases, etc.) is important to the design of statistical speech and NLP systems. This paper presents a general source-coding framework for inducing words from natural language text without word boundaries. An efficient search algorithm is developed to optimize the minimum description length (MDL) induction criterion. Despite some seemingly over-s...
متن کاملCan MDL Improve Unsupervised Chinese Word Segmentation?
It is often assumed that MinimumDescription Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Mandarin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algorithms previously proposed in the literature. Suprisingly, we show that this lower...
متن کاملA Goodness Measure for Phrase Learning via Compression with the MDL Principle
This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or character) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a be...
متن کاملUnsupervised Lexical Learning As Inductive Inference via Compression
This paper presents a learning-via-compression approach to unsupervised acquisition of word forms with no a priori knowledge. Following the basic ideas in Solomonoff’s theory of inductive inference and Rissanen’s MDL framework, the learning is formulated as a process of inferring regularities, in the form of string patterns (i.e., words), from a given set of data. A segmentation algorithm is de...
متن کاملFully Unsupervised Word Segmentation with BVE and MDL
Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to gene...
متن کامل